{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Transformer 的实现细节](https://zhuanlan.zhihu.com/p/347709112)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 包的导入"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import math\n",
    "import importlib\n",
    "\n",
    "import spacy\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.autograd import Variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-2.3.0/zh_core_web_sm-2.3.0.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 2. 使用 Spacy 构建分词器\n",
    "\n",
    "首先，我们要对输入的语句做分词，这里我使用 spacy 来完成这件事，你也可以选择你喜欢的工具来做。\n",
    "\n",
    "![](https://pic1.zhimg.com/80/v2-471a3ca135eb8675aa675b7bd4c5af00_1440w.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class Tokenize(object):\n",
    "    \n",
    "    def __init__(self, lang):\n",
    "        self.nlp = importlib.import_module(lang).load()\n",
    "            \n",
    "    def tokenizer(self, sentence):\n",
    "        sentence = re.sub(\n",
    "        r\"[\\*\\\"“”\\n\\\\…\\+\\-\\/\\=\\(\\)‘•:\\[\\]\\|’\\!;]\", \" \", str(sentence))\n",
    "        sentence = re.sub(r\"[ ]+\", \" \", sentence)\n",
    "        sentence = re.sub(r\"\\!+\", \"!\", sentence)\n",
    "        sentence = re.sub(r\"\\,+\", \",\", sentence)\n",
    "        sentence = re.sub(r\"\\?+\", \"?\", sentence)\n",
    "        sentence = sentence.lower()\n",
    "        return [tok.text for tok in self.nlp.tokenizer(sentence) if tok.text != \" \"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['你好', '，', '这里', '是', '中国', '。']"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenize = Tokenize('zh_core_web_sm')\n",
    "tokenize.tokenizer('你好，这里是中国。')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Input Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Token Embedding\n",
    "\n",
    "给语句分词后，我们就得到了一个个的 token，我们之前有说过，要对这些token做向量化的表示，这里我们使用 pytorch 中torch.nn.Embedding 让模型学习到这些向量。\n",
    "\n",
    "![](https://pic3.zhimg.com/80/v2-e81ede6421578538abff90810b840006_1440w.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Embedding(nn.Module):\n",
    "    \n",
    "    def __init__(self, vocab_size, d_model):\n",
    "        super().__init__()\n",
    "        self.d_model = d_model\n",
    "        self.embed = nn.Embedding(vocab_size, d_model)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        return self.embed(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Positional Encoder\n",
    "\n",
    "前文中，我们有说过，要把 token 在句子中的顺序也加入到模型中，让模型进行学习。这里我们使用的是 position encodings 的方法。\n",
    "\n",
    "![](https://pic1.zhimg.com/80/v2-52029c789f5246ca48445ee612075230_1440w.png)\n",
    "\n",
    "![](https://pic3.zhimg.com/80/v2-c67c996b7a26a34168d562e53465d9f6_1440w.png)\n",
    "\n",
    "![](https://pic3.zhimg.com/80/v2-ea835d0d07cc9b410b4de371b1c78132_1440w.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PositionalEncoder(nn.Module):\n",
    "    \n",
    "    def __init__(self, d_model, max_seq_len = 80):\n",
    "        super().__init__()\n",
    "        self.d_model = d_model\n",
    "        \n",
    "        # 根据pos和i创建一个常量pe矩阵\n",
    "        pe = torch.zeros(max_seq_len, d_model)\n",
    "        for pos in range(max_seq_len):\n",
    "            for i in range(0, d_model, 2):\n",
    "                pe[pos, i] = \\\n",
    "                math.sin(pos / (10000 ** ((2 * i)/d_model)))\n",
    "                pe[pos, i + 1] = \\\n",
    "                math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))\n",
    "                \n",
    "        pe = pe.unsqueeze(0)\n",
    "        self.register_buffer('pe', pe)\n",
    " \n",
    "    def forward(self, x):\n",
    "        # 让 embeddings vector 相对大一些\n",
    "        x = x * math.sqrt(self.d_model)\n",
    "        # 增加位置常量到 embedding 中\n",
    "        seq_len = x.size(1)\n",
    "        x = x + Variable(self.pe[:,:seq_len], \\\n",
    "                         requires_grad=False).cuda()\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Transformer Block\n",
    "\n",
    "有了输入，我们接下来就要开始构建 Transformer Block 了，Transformer Block 主要是有以下4个部分构成的：\n",
    "\n",
    "self-attention layer\n",
    "normalization layer\n",
    "feed forward layer\n",
    "another normalization layer\n",
    "它们之间使用残差网络进行连接，详细在上文同一个图下有描述，这里就不再赘述了。\n",
    "\n",
    "![](https://pic3.zhimg.com/80/v2-b228cbbb9fa7732ee6dfd2abb743b062_1440w.jpg)\n",
    "\n",
    "Attention 和 Self-attention 在前面的两篇文章中有详细的描述，不太了解的话，可以跳过去看看。\n",
    "\n",
    "[Transformer 一篇就够了（一）： Self-attenstion](https://zhuanlan.zhihu.com/p/345680792)\n",
    "\n",
    "[Transformer 一篇就够了（二）： Transformer中的Self-attenstion](https://zhuanlan.zhihu.com/p/347492368)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Attention\n",
    "\n",
    "![](https://pic2.zhimg.com/80/v2-b900fb952a100acd7dd8cd65ebd8bd61_1440w.gif)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "def attention(q, k, v, d_k, mask=None, dropout=None):\n",
    "    \n",
    "    scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)\n",
    "    \n",
    "    # mask掉那些为了padding长度增加的token，让其通过softmax计算后为0\n",
    "    if mask is not None:\n",
    "        mask = mask.unsqueeze(1)\n",
    "        scores = scores.masked_fill(mask == 0, -1e9)\n",
    "    \n",
    "    scores = F.softmax(scores, dim=-1)\n",
    "    \n",
    "    if dropout is not None:\n",
    "        scores = dropout(scores)\n",
    "        \n",
    "    output = torch.matmul(scores, v)\n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 这个 attention 的代码中，使用 mask 的机制，这里主要的意思是因为在去给文本做 batch化的过程中，需要序列都是等长的，不足的部分需要 padding。但是这些 padding 的部分，我们并不想在计算的过程中起作用，所以使用 mask 机制，将这些值设置成一个非常大的负值，这样才能让 softmax 后的结果为0。关于 mask 机制，在 Transformer 中有 attention、encoder 和 decoder 中，有不同的应用，我会在后面的文章中进行解释。\n",
    "\n",
    "![](https://pic1.zhimg.com/v2-b121c4e5aabb704d1c7d87032cb28544_b.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 MultiHeadAttention\n",
    "\n",
    "多头的注意力机制，用来识别数据之间的不同联系，前文中的第二篇也已经聊过了。\n",
    "\n",
    "![](https://pic1.zhimg.com/v2-2b3afd542c3a01dc3153e365ca8349a0_b.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MultiHeadAttention(nn.Module):\n",
    "    \n",
    "    def __init__(self, heads, d_model, dropout = 0.1):\n",
    "        super().__init__()\n",
    "        \n",
    "        self.d_model = d_model\n",
    "        self.d_k = d_model // heads\n",
    "        self.h = heads\n",
    "        \n",
    "        self.q_linear = nn.Linear(d_model, d_model)\n",
    "        self.v_linear = nn.Linear(d_model, d_model)\n",
    "        self.k_linear = nn.Linear(d_model, d_model)\n",
    "        \n",
    "        self.dropout = nn.Dropout(dropout)\n",
    "        self.out = nn.Linear(d_model, d_model)\n",
    "    \n",
    "    def forward(self, q, k, v, mask=None):\n",
    "        \n",
    "        bs = q.size(0)\n",
    "        \n",
    "        # perform linear operation and split into N heads\n",
    "        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)\n",
    "        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)\n",
    "        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)\n",
    "        \n",
    "        # transpose to get dimensions bs * N * sl * d_model\n",
    "        k = k.transpose(1,2)\n",
    "        q = q.transpose(1,2)\n",
    "        v = v.transpose(1,2)\n",
    "        \n",
    "\n",
    "        # calculate attention using function we will define next\n",
    "        scores = attention(q, k, v, self.d_k, mask, self.dropout)\n",
    "        # concatenate heads and put through final linear layer\n",
    "        concat = scores.transpose(1,2).contiguous()\\\n",
    "        .view(bs, -1, self.d_model)\n",
    "        output = self.out(concat)\n",
    "    \n",
    "        return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Norm Layer\n",
    "\n",
    "这里使用 Layer Norm 来使得梯度更加的平稳，关于为什么选择 Layer Norm 而不是选择其他的方法，有篇论文对此做了一些研究，Rethinking Batch Normalization in Transformers，对这个有兴趣的可以看看这篇文章。\n",
    "\n",
    "[](https://link.zhihu.com/?target=https%3A//arxiv.org/pdf/2003.07845.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "class NormLayer(nn.Module):\n",
    "    \n",
    "    def __init__(self, d_model, eps = 1e-6):\n",
    "        super().__init__()\n",
    "    \n",
    "        self.size = d_model\n",
    "        \n",
    "        # create two learnable parameters to calibrate normalisation\n",
    "        self.alpha = nn.Parameter(torch.ones(self.size))\n",
    "        self.bias = nn.Parameter(torch.zeros(self.size))\n",
    "        \n",
    "        self.eps = eps\n",
    "    \n",
    "    def forward(self, x):\n",
    "        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True)) \\\n",
    "        / (x.std(dim=-1, keepdim=True) + self.eps) + self.bias\n",
    "        return norm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 Feed Forward Layer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "class FeedForward(nn.Module):\n",
    "    \n",
    "    def __init__(self, d_model, d_ff=2048, dropout = 0.1):\n",
    "        super().__init__() \n",
    "    \n",
    "        # We set d_ff as a default to 2048\n",
    "        self.linear_1 = nn.Linear(d_model, d_ff)\n",
    "        self.dropout = nn.Dropout(dropout)\n",
    "        self.linear_2 = nn.Linear(d_ff, d_model)\n",
    "    \n",
    "    def forward(self, x):\n",
    "        x = self.dropout(F.relu(self.linear_1(x)))\n",
    "        x = self.linear_2(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Encoder\n",
    "\n",
    "Encoder 就是将上面讲解的内容，按照下图堆叠起来，完成将源编码到中间编码的转换。\n",
    "\n",
    "![](https://pic3.zhimg.com/v2-d2033d0d3b33d7f61bf2febc26ea5cce_b.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_clones(module, N):\n",
    "    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Encoder(nn.Module):\n",
    "    \n",
    "    def __init__(self, vocab_size, d_model, N, heads, dropout):\n",
    "        super().__init__()\n",
    "        self.N = N\n",
    "        self.embed = Embedder(vocab_size, d_model)\n",
    "        self.pe = PositionalEncoder(d_model, dropout=dropout)\n",
    "        self.layers = get_clones(EncoderLayer(d_model, heads, dropout), N)\n",
    "        self.norm = Norm(d_model)\n",
    "        \n",
    "    def forward(self, src, mask):\n",
    "        x = self.embed(src)\n",
    "        x = self.pe(x)\n",
    "        for i in range(self.N):\n",
    "            x = self.layers[i](x, mask)\n",
    "        return self.norm(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Decoder\n",
    "\n",
    "Decoder部分和 Encoder 的部分非常的相似，它主要是把 Encoder 生成的中间编码，转换为目标编码。后面我会在具体的任务中，来分析它和 Encoder 的不同。\n",
    "\n",
    "![](https://pic1.zhimg.com/v2-a23cc398fd8f44225208bbbc2b25ec48_b.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Decoder(nn.Module):\n",
    "    \n",
    "    def __init__(self, vocab_size, d_model, N, heads, dropout):\n",
    "        super().__init__()\n",
    "        self.N = N\n",
    "        self.embed = Embedder(vocab_size, d_model)\n",
    "        self.pe = PositionalEncoder(d_model, dropout=dropout)\n",
    "        self.layers = get_clones(DecoderLayer(d_model, heads, dropout), N)\n",
    "        self.norm = Norm(d_model)\n",
    "        \n",
    "    def forward(self, trg, e_outputs, src_mask, trg_mask):\n",
    "        x = self.embed(trg)\n",
    "        x = self.pe(x)\n",
    "        for i in range(self.N):\n",
    "            x = self.layers[i](x, e_outputs, src_mask, trg_mask)\n",
    "        return self.norm(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Transformer\n",
    "\n",
    "![](https://pic3.zhimg.com/v2-a835a5deeae6f3458a0c8282d9a16e56_b.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Transformer(nn.Module):\n",
    "    \n",
    "    def __init__(self, src_vocab, trg_vocab, d_model, N, heads, dropout):\n",
    "        super().__init__()\n",
    "        self.encoder = Encoder(src_vocab, d_model, N, heads, dropout)\n",
    "        self.decoder = Decoder(trg_vocab, d_model, N, heads, dropout)\n",
    "        self.out = nn.Linear(d_model, trg_vocab)\n",
    "        \n",
    "    def forward(self, src, trg, src_mask, trg_mask):\n",
    "        e_outputs = self.encoder(src, src_mask)\n",
    "        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)\n",
    "        output = self.out(d_output)\n",
    "        return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}